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ABSTRACT 

To save time and money, businesses and individuals have be- 
gun outsourcing their data and computations to cloud com- 
puting services. These entities would, however, like to ensure 
that the queries they request from the cloud services are be- 
ing computed correctly. In this paper, we use the principles 
of economics and competition to vastly reduce the complex- 
ity of query verification on outsourced data. We consider 
two cases: First, we consider the scenario where multiple 
non-colluding data outsourcing services exist, and then we 
consider the case where only a single outsourcing service ex- 
ists. Using a game theoretic model, we show that given the 
proper incentive structure, we can effectively deter dishonest 
behavior on the part of the data outsourcing services with 
very few computational and monetary resources. We prove 
that the incentive for an outsourcing service to cheat can 
be reduced to zero. Finally, we show that a simple verifi- 
cation method can achieve this reduction through extensive 
experimental evaluation. 
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1. INTRODUCTION 

As the amount of data that we generate increases, so does 
the time and effort necessary to process and store the data. 
With an increase in time and effort comes an increase in 
monetary cost. To this end, many have turned to outsourc- 
ing their data processing to "the cloud." Cloud computing 
services are offered by many large companies, such as Ama- 
zon, IBM, Microsoft, and Google, as well as smaller com- 
panies such as Joyent and CSC. For example, Google '7^ 
recently launched the Google BigQuery Service, which is de- 
signed for exactly this purpose: outsourced data processing. 
The distributed nature of these cloud services shortens data 
processing time significantly, and offers a massive amount of 
storage. 
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Figure 1: Data Outsourcing with Verification 



In a perfect world, these cloud providers would impartially 
devote all the computation necessary to any task paid for 
by the subscribers. In such a world, the querying process 
would look like figure I (minus the verifier), where the sub- 
scriber outsources the data D to the cloud, and sends queries 
(Q), and the cloud does the necessary calculations and re- 
turns the result (Q{D)). However, a cloud provider is a 
self-interested entity. Since it is very difficult for the users 
of the cloud to see the inner workings of the cloud service, 
a cloud provider could "cut corners," delivering a less ac- 
curate or incomplete computation result which would take 
fewer system resources to compute. This would, of course, 
save computational resources for the provider, provided the 
subscriber was unable to tell a false result from a true one. 
Because of this, query verification, or the assurance of query 
result correctness, has been identified as one of the major 
problems in data outsourcing |17j. 

Many techniques have been developed and employed for 
query verification. Query verification is the process of verify- 
ing the authenticity of an outsourced query result. In figure 
I above, the subscriber sends a query to the outsourcing 
service, and receives a response. Query verification would 
then be another process where the subscriber determines if 
the response is, in fact, the result of the query. The ver- 
ification process may belong to the owner, or it may be 
another process entirely. In any case, the verifier aims to 
make sure that the outsourced server responded correctly. 
These verification techniques range from simple to extremely 
complex, and generally rely on the subscriber storing some 
sketch of the data (much smaller in size), or some cryp- 
tographic protocols. Such protocols do a good job verifying 
the data, but are often slow, or only work with specific types 
of queries. Many of them require that the subscriber know 



which queries he will execute in advance, so that a sketch can 
be created for each one. None of these, however, consider 
the heart of the problem: the self-interest of the parties. 



our method requires running the verification on only a frac- 
tion of the queries, incurring a much lower expected runtime 
than a full sample-based verification method. 



The problem of data outsourcing, and the resultant query 
verification, is fundamentally a problem of incentives. A 
cloud subscriber wants to get the result of his queries ac- 
curately and efficiently, with as low a cost as possible. A 
cloud provider, however, is most concerned about the prof- 
itable use of its computing resources. These incentives can 
be at odds with each other. The natural way of analyzing 
competing incentives is to use game theory. 

Game theory is a branch of economics which studies com- 
petitive behavior between parties. An interaction between 
parties is cast as a game, where players use strategy to max- 
imize their gains. The gains from an interaction can be 
offset artificially by contracts, enforced by law. These ad- 
justments can make actions which were once profitable, i.e., 
"cutting corners" in a calculation, less profitable through the 
use of penalties. The contracts, therefore, aim not to detect 
whether a cloud provider is cheating, but to remove the in- 
centive for the provider to cheat altogether. 

We propose a game theory-based approach to query verifica- 
tion on outsourced data. We model the process of querying 
outsourced data as a game, with contracts used to enforce 
behavior. Data outsourcing does not take place in a vac- 
uum. Service Level Agreements (SLAs) exist for all types 
of cloud services 14 , and are enforceable contracts in court. 
Thus, we can augment the SLA with an incentive structure 
to encourage honest behavior. Using a very simple query 
verification technique, we show that even the threat of ver- 
ification is enough to deter cheating by a cloud provider. 

First, we consider the case where multiple, non-colluding 
cloud providers exist. Non-colluding means that the cloud 
providers do not share information. We feel this is realistic, 
since cloud providers are competing entities and do not wish 
to share data with the competition. In this scenario, we 
show that without the use of special verification techniques, 
a data owner can guarantee correct results from rational 
cloud providers, while incurring an additional cost that is 
only a small fraction of the overall computation cost. 

We also consider the case where only a single cloud provider 
is used. A data owner may wish to use only a single cloud 
provider to save money, as they may not have the money to 
hire multiple cloud services. In addition, a data owner may 
simply wish to minimize the outside exposure of the data. 
We choose to demonstrate our approach using the simple 
random sampling query verification technique. This tech- 
nique was rejected in many works before, because it required 
a relatively large storage overhead to achieve a close bound 
on the sample result. For our approach, we do not need in- 
credibly close bounds on the result. We only need bounds 
close enough to catch some mistakes. The simple random 
sampling technique also has the added bonus that it can be 
used to verify many different types of queries, including any 
aggregates which can be estimated from a sample (count, 
sum, average, standard deviation, median, quantiles, max, 
min, etc), and also estimate the size of selection queries, al- 
lowing for some verification on those queries as well. Finally, 



Our contributions can be summarized as follows: 



• We develop a game theoretic model of query verifica- 
tion on outsourced data. 

• For both the multiple non-colluding cloud case and 
the single cloud case, we show that the model has an 
equilibrium where the cloud provider behaves honestly. 

• We show that a simple sampling technique, although 
rejected by other works, becomes practical in our single 
cloud setting. 

• Through extensive experimentation, we show that use 
of this simple sampling method, coupled with our in- 
centive structure, deters cheating in practice. 

• Finally, we show that our incentives can improve the 
expected runtime of any query verification method, 
making it extremely fiexible. 



Our paper does not consider the privacy of the outsourced 
data (similar to [s]). However, any privacy-preserving tech- 
nique for outsourcing data could still be used in our frame- 
work. The use of our game theoretic techniques will not 
affect the privacy-preserving properties of such schemes. 



2. RELATED WORK 

Several scholarly works have outlined query verification meth 
ods. The vast majority of these works focus on specific types 
of queries. Some focus only on selection [2] ij [lO 18 20 



while others focus on relational queries such as selection, 
projection, and joins [13[ [12] . Still others focus only on ag- 
gregation queries like sum, count, and average [sj [l9 [21 . 
Some of these processes |16[ |21| require different verifica- 
tion schemes for each type of query, or even each individual 
query, requiring that the subscriber knows which queries will 
be asked in advance. 

Many of the schemes require complex cryptographic proto- 
cols. Some encrypt the data itself, relying on homomorphic 
schemes to allow the cloud provider to perform the compu- 
tation [6| [19] . A homomorphic operation will always be less 
efficient than the operation on the plaintext, rendering the 
overhead of these protocols greater by orders of magnitude. 
Others, such as [I6], rely on relatively simpler cryptographic 
primitives, like secure hash functions. To maintain integrity, 
our scheme will also use hash functions. Our verification 
framework is, however, simpler than these, and can be used 
to improve the expected runtime of any of these verification 
schemes. 

The work of Canetti, Riva, and Rothblum fsl also makes 
use of multiple outsourcing services for query verification. 
However, they make use of all the services all the time, and 
require a logarithmic number of rounds to ensure verifiabil- 
ity of computation. In addition, they assume that at least 
one of the cloud providers is in fact honest. We, in contrast, 
do not assume that any provider is honest, merely that they 
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Figure 2: A simple game with a mixed strategy equi- 
librium 



are rational (meaning that the provider wishes to maximize 
his profits), and we only use additional providers a fraction 
of the time. In addition, we only require one round of com- 
putation. 

3. BACKGROUND 

Before delving into the depths of outsourced query verifi- 
cation, some background knowledge is necessary. We will 
require some basic knowledge of game theory. In addition, 
we will be making use of some basic cryptographic primitives 
to ensure data integrity. 

3.1 Game Theory 

Game theory, despite the misleading name, is a widely ac- 
cepted field of economic theory which studies competitive 
behavior. Competing parties are known as players, and the 
competition itself is known as a game. A game contains 
four basic elements: players, actions, payoffs, and informa- 
tion [15]. Players have actions which they can perform at 
designated times in the game, and as a result of the ac- 
tions in the game, players receive payoffs. The payoffs are 
represented as real-valued functions which depend on the 
actions chosen and the information surrounding the game. 
The players can have different pieces of information, which 
can have a tremendous impact on the outcome of the game. 
The players aim to use a profitable strategy to increase their 
payout. A player who acts in such a way as to maximize 
his or her payout, regardless of the effects on other play- 
ers, is called rational. Games take many forms, and vary 
in the four attributes mentioned above, but all games deal 
with them. The specific games we describe in this paper 
are finite player, two-step, incomplete information Bayesian 
games, with payouts based on the final result of players' 
actions. 

A game is said to be at equilibrium when no single player can 
unilaterally increase his or her payoff by changing his or her 
strategy. In such a scenario, no players have any incentive 



to choose a different strategy. It was shown by Nash 11 
that all finite player games have an equilibrium, although 
the equilibrium might require mixed strategies. A mixed 
strategy is a strategy in which players choose each of the 
available actions with a certain probability. For example, 
consider the game with two players, A and B. During the 
game, the players can choose either action X or action Y, 
and both players choose their actions simultaneously. If both 
players choose the same action, player A recieves a utility of 
1, while player B recieves a utility of 0. Otherwise, player 
B recieves a utility of 1, and player A recieves a utility of 0. 
This game can be represented by the table in figure 1. 

Suppose player A's strategy is to always choose action X. 
Player B could then choose his action to be Y, and guar- 
antee himself a payout of 1. However, if this was the case. 



then player A could simply alter his strategy to choose ac- 
tion Y, thwarting player B's strategy. Suppose, however, 
that player A's strategy is to fiip a fair coin, choose X if it 
comes up heads, and tails if it comes up Y. In this scenario, 
no matter what player B chooses, player B's expected pay- 
out is |. Player B can also choose to use this strategy. If 
both players use this strategy, then the game is in equlib- 
rium, since neither player has any incentive to unilaterally 
change strategy. This equilibrium is the only equilibrium 
of the game, and since the strategies are probabilistic, the 
equilibrium is a mixed strategy equilibrium. 

We can also frame the above game as a game with a pure 
strategy equilibrium, but with continuous actions. Instead 
of having the actions be X and Y , we allow each player to 
select, as his action, a probability between zero and one that 
they would select X. Let the probability that A chooses X 
be a, and the probability that B chooses X be /3. As before, 
the equilibrium is a = /3 = |. However, this equilibrium is 
in pure strategies, since the action is now to choose the prob- 
ability, not the action as before. This could be considered 
an irrelevant distinction. However, it will prove to be useful 
in our game theoretic model. 

For behavior at an equilibrium to be considered rational, 
it must not only be incentive compatible, meaning that no 
player has any incentive to unilaterally deviate from that 
strategy, but it must also be individually rational. Individual 
rationality means that each player is expected to be no worse 
off than they were before they chose to participate in the 
game. More formally, it means that the payoffs for each 
player in the equilibrium have an expected value greater 
than or equal to zero. 



3.2 Cryptographic Primitives 

In order to maintain the integrity of our databases, we will 
need to employ some basic cryptographic primitives. We will 
need to employ a scheme that allows the owner to make sure 
that tuples he receives from the server are legitimate, and 
were not added or altered by the server. We can use a sim- 
ple message authentication code protocol known as HMAC 
to do this. HMAC requires the use of cryptographic hash 
functions. 

A cryptographic hash function or one-way hash function is a 
function mapping a large, potentially infinite, domain to a fi- 
nite range. This function is simple to compute (taking poly- 
nomial time), but is difficult to invert. Equivalently, we can 
say that, for a cryptographic hash function /, it is difficult 
to find an x and y such that x ^ y and f{x) = f{y). Exam- 
ples of cryptographic hash functions include MD5, SHA-1, 
and SHA-256. 

The HMAC (Hashing Message Authentication Code) system 
creates a keyed hash function from an existing cryptographic 
hash function. Let m be the message we wish to create a 
code for, and k be the key we wish to use. Let / be our 
cryptographic hash function, and let its required input size 
be n. If k has a length smaller than n, we pad k with zeroes 
until it has size n. If k is larger, we let k be /(fc) for the 
purposes of calculating the HMAC function. We define the 



HMAC function as follows: 

HMAC{m,k) = f[{k(B outpad)\\f{k(Binpad)\\m) 

Where outpad and inpad are two constants which are the 
length of /'s block size (in practice, 0x5c...5c and 0x36. ..36, 
respectively). 

Given a message m and its HMAC value h, if we have the 
key k, we can simply check to see if HMAC(m,fc) matches 
h. If it does, then the probability that the message is not 
legitimate (i.e., fabricated or altered), is negligible. 

4. THE MULTIPLE CLOUD CASE 

We first consider the case where multiple cloud providers 
exist, which do not collude. This means that the parties do 
not exchange strategies, and do not exchange information. 
We model the query verification process as a game. The 
game has the following characteristics: 

Players: Three players, the Data Owner(O), and two out- 
sourced servers (5*1 and 5*2). 

Actions: The data owner begins the game by selecting a 
probability a, and declares this probability to the servers. 
He then sends the query (Q) to one of the two servers, with 
equal probability. With probability a he also sends the 
query to the other server. If server Si receives the query, 
they then respond to the query with either Q{D), that is, 
the query result on the database D, or Q[{D) which is some 
result other than Q{D). We apply the subscript i to Q' 
to indicate that one player's method of cheating is differ- 
ent from the other players' method of cheating. We denote 
the honest action as h, and the cheating action as c. These 
actions are depicted in figure 3. 

Information: Data Owner O has given his database D to 
and 5*2, with an HMAC message authentication code ap- 
pended to each tuple. Any message authentication scheme 
would work here, but its purpose and only effect is that it 
maintains the integrity of the data. This means that the 
servers cannot alter any tuples and cannot add any tuples 
without being detected. The players have entered into an 
agreement (a contract) before the game, and the contents of 
this contract are known to all players. The contract could 
contain the probability a. 

Payoffs: The owner recieves the information value of the 
results received, given by Iv{Q), where Q is either Q{D) or 
Q'i{D), minus the amount paid to the servers P{Q). The 
servers recieve this payment, minus the cost of computing 
the query, C{Q). For simplicity's sake, we assume that both 
outsourcing services have the same cost of computation and 
receive the same payment for the query. The logic below 
easily applies to the case where costs are different, but this 
assumption simplifies the equations involved. These payoffs 
are additionally adjusted by the aforementioned contract. 

We assume that h{Q(D)) > {l+a)P{Q) and P(Q) > C{Q). 
If this were not the case, then the game would not be in- 
dividually rational without some outside subsidies (that is, 
some player's expected payout would be less than zero). In 
essence, we want to ensure that the subscriber would want to 




Figure 3: The Two-Cloud Query Verification System 



pay {l + a)P{Q) to receive the result, and the cloud provider 
would accept P{Q) for the computation. To do this, we 
make sure that the value that the subscriber places on the 
query is at least the expected payment, and the cost to the 
cloud providers is no more than the amount they would be 
paid. No one takes a loss on the transaction. 

We now present two contracts, both of which provide sim- 
ple solutions to the above game in which neither server has 
incentive to cheat. The first is very simple and requires 
no additional computation. The second is intuitively more 
fair, and thus more likely to be accepted in a real world 
scenario. Both contracts, however, would be accepted by 
rational players. 

Contract 1 If the owner asks for query responses from both 
servers, and the results do not match, both servers pay a 
penalty of F to the owner, and return the money paid for 
the computation P{Q) as well. 

Tiieorem 4.1 The above game with contract 1 has an indi- 
vidually rational, incentive compatible equilibrium in which 
the servers behave honestly. 

Proof: Let C(Q'i) be the cost of computing Q'^ for Si. Note 
that, because Si and ^2 do not collude, Si does not know 
Q21 a-nd 5*2 does not know Q'l. The only function both know 
for sure is Q. Without additional knowledge, we can assume 
that the probability that Qi{D) — Q^-D) is negligible. For 
a player to even consider returning Qi instead of Q, we must 
have C(Q'i) < C{Q), since a player will not cheat if they do 
not gain anything from it. We also assume that Iv{Qi{D)) < 
< Iv{Q{D)), since not only is the false result not what the 
owner asked for, but also appears to be the true result if not 
verified. If the wrong answer is believed to be correct, this 
would lead to wrong decisions, and ultimately, financial loss, 
on the part of the owner. Now, we can define the expected 
payoffs to each player, where iip(x, y) is the expected utility 
for player P when Si takes action x and S2 takes action y. 
Note that, in these equations and throughout the rest of the 
paper, we omit the argument D from Q, since D is fixed. 
We begin with O. If both players are honest (equation[TJ, O 
recieves the value of the information gained from the query, 
minus the expected payment for the calculation, 1 + a times 
P{Q)- If one player is dishonest (equations |2] and [3|, then 
with probability a, O detects this and gets both the honest 
and the dishonest result and the fine F from both players. 
With probability 1 — q, he does not detect this, and gets 



either the correct value or the incorrect value with equal 
probability. In the event that both players cheat (equation 
|4|, they are once again caught with probability a, but in 
this case, when they are not caught, O receives only bogus 
values. This results in the following equations: 

uo{h,h) = hiQ{D))-{l + a)P{Q) (1) 

uo{h,c) = a{2F + I4Q) + I4Q'2)) (2) 

+ (1 - <^K^iiv(Q) + hiQ'i)) - pm 

uo{c,h) = a{2F + h{Q) + I„{Q[)) (3) 

+ (1 - a)(^(/,(Q) + h{Q[)) - P(Q)) 
uo{c, c) = a{2F + I4Q[) + h{Q'.,)) (4) 

+ (1 - <^){\{UQ'i) + UQ'2)) - P{Q)) 



to get: 



Rearranging and combining terms, we get: 
liC(Q)-C{Q[)) < aF + aP(Q) 



+ ^{C{Q)-C{Q[)) 



Let G represent the quantity C{Q) — C{Q'i), that is, the 
amount the server would gain from cheating. Substituting 
this in and factoring out an a, we get: 

^G<a{F + P{Q) + ^G) 



For the servers, if both servers are honest (equations [s] and 
[sjl, they receive the payment for the query, minus the cost of 
the query, provided they are selected to perform the calcula- 
tion. This selection probability is why the equations below 
contain |. Otherwise, they gain nothing and lose nothing. 
If one player is dishonest, that player ( equat ions [7| and 1 1 [ ) , 
regardless of whether the other player is honest, with prob- 
ability a is caught, and loses the fine F. With probability 
1 — Q, the player is not caught, and gains the payment P{Q), 
minus the cost of computing his cheat, C{Qi), if he is chosen 
for the computation. If a player is honest while the other 
player is dishonest (equations 6] and [9| , that player simi- 
larly is punished with probability a, but invests a cost of 
C{Q) instead of G{Q'i) in the computation. This gives us 
the following equations: 



usAKh) = i^i.^ + o,){P{Q) - C{Q)) 



(5) 



us^iKc) = -{1 - a){P{Q) - C{Q)) - aF (6) 



us^ (c, h) = us, (c, c) = -(1 - «)(P(Q) - C(Ol)) - aF 



us.iKh) = - {1 + a){P{Q) - C{Q)) 



(7) 



(8) 



us,{c,h) = -{l~a){P{Q)^C{Q))-aF (9) 

us, [h, c) = us, (c, c) = ^(1 - a)(P(Q) - CiQ'^)) - aF 

(10) 



We can now find the a for which the expected value for Si 
is less when he cheats than when he is honest, assuming S2 
is honest. By symmetry, this will be the same for 5*2. Thus, 
we set: 

^(1 - «)(P(Q) - C(Q'i)) -aF<\{l + a)iP{Q) - C{Q)) 



Let H represent the quantity P{Q) — C{Q), and H' represent 
the quantity P{Q)-C{Q[). Distribute the (l+a) and (1-a) 



Multiplying through by two, we get: 

G < a{2F + 2P{Q) + G) 

And, solving for a. 



G 



2F + 2P{Q) + G 



< a 



(11) 



Since we can define F to be whatever we want in the con- 
tract, we can make this minimum a value arbitrarily small. 
If a is at least this much, then Si (and by symmetry, S2) 
has no incentive to cheat. If 52 is not honest, then Si has 
no incentive to be honest, but the payout is less for both 
(much less, if F is large). Therefore, the best outcome is for 
both players to behave honestly. 

Now, we need to show that choosing a is incentive compati- 
ble for O. Given that both players are honest, O's utility is 
given as: 

uo{h, h) = h{Q{D)) - (H- q)P(Q) 

Which, by our assumption, is greater than or equal to zero. 
Thus, it is individually rational for O. If a is increased, it 
merely decreases this value, so O has no incentive to increase 
a. If we decrease a, then Si and S2 will start cheating! This 
leads to: 



uo{c,c) = a(2F + J4Q'i) + J4Q2)) 

+ (1 - a)(^(/4Q'i) + F{Q'.,)) - P{Q)) 



Now, sin ce our a is less than our prescribed value in equa- 
tion (111, F is bounded above by ^ — 2P{Q) — G. Because 



of this, as a tends to zero, the first term of the above equa- 
tion decreases. The second term is negative (as Iv{Qi) and 
Iv{Q'2) are less than zero), and gets worse as a tends to zero. 
Thus, if Q is less than our prescribed value, O expects to lose 
value from cheating. So, O has no incentive to deviate from 
a = e 

" 2F+2P(Q) + G' 



Now, in practice, O docs not know G. Thus, ho must choose 
the smaUest a that he knows he can use. Since P{Q) > 
C{Q) >G,0 can choose a = 2f+2p(Q)-p(q) = 2f-pIq) ■ 

As this is both incentive compatible and individually ratio- 
nal for all players, this contract creates the best possible 
equilibrium where 5i and S2 do not cheat, and O pays only 
(1 + a) times the price of a single computation (where a is 
small). □ 

Now, it might seem unfair to punish both players when only 
one player cheats. The rational player would see the above 
contract as completely fair, but humans are not always com- 
pletely rational. Thus, we also briefly examine a contract 
which identifies the cheater and punishes only the cheater. 

Contract 2 If the owner asks for query responses from both 
servers, and the results do not match, the owner performs 
a potentially costly audit of the computation. Each server 
whose result does not match the result given by the owner's 
process pays a fine F to the owner. 

The audit process mentioned above could be done in sev- 
eral ways. The simplest, although most expensive, of these 
would be for the owner to retrieve all the data, then per- 
form the query himself. Obviously, this defeats the purpose 
of data outsourcing. Based on the fact that the outsourced 
data uses some message authentication codes to keep the 
data from being modified, we can improve this. First, for 
selection queries, if one player fails any MAC checks, then 
they are obviously cheating. If one player returns fewer re- 
sults than the other, then they are also obviously cheating. 
For aggregate queries, we can have each source return the 
tuples which were selected for the aggregation process. We 
can then chock to sec if the aggregate query result matches 
the values returned by the server. Finding a tuple set that 
matches a false query result might prove incredibly difficult, 
if the false query was not generated from a sample. We can 
also apply the same techniques used for selection queries, 
noting that the cloud that returns fewer tuples must be 
cheating (provided all tuples returned are authenticated). 
Essentially, for a given query, we end up asking the clouds 
to "show their work," or face consequences. 

Theorem 4.2 The above game under contract 2 also has 
an equilibrium where both servers remain honest. 

Proof: The main difi'crcnces between this and the previous 
scenario are the fact that an honest server will never pay a 
fine, and that if a player is caught cheating, the owner must 
perform a costly audit. We will call this cost C{Qo{D))- 
As the data is signed with the HMAC codes, the owner can 
retrieve all of it from either server, guaranteed. As us^ (h, h) 
and usi (c, h) do not change, the a may be located in the 
same way. When one player cheats, the other player has 
incentive to be honest, as it avoids the fine. Thus, (/i, h) 
is actually a dominant strategy in this game, when a is set 
high enough. Now, recall that F can be set as high as nec- 
essary. If we double F and increase it by the cost of the 
audit C{Qo{D)), then the payouts for O would be the same 
as in contract 1. So, by Theorem 4.1, this is also incentive 
compatible for O. 



As this is both incentive compatible and individually ratio- 
nal for all players, this contract creates the best possible 
equilibrium where Si and 52 do not cheat, and O pays only 
(1 -|- a) times the price of a single computation (where a 
is small). Note that, since both Si and S2 are honest, we 
never expect to pay the cost of the audit. □ 

Note the generality of this result. In contrast with many 
other results, it works for any query on any database 
(with the caveat that the query is deterministic), and it 
works in only one round of computation. 

5. THE SINGLE CLOUD CASE 

Though the above scenario is quite simple and very efficient, 
it does require giving both money and data to multiple par- 
tics. It might be that the cost of maintaining two cloud 
services (due to storage costs and other overhead) is pro- 
hibitively expensive. A data owner might also want to min- 
imize the outside exposure of his data set. It is possible, 
then, to use a similar scheme to verify the result from a sin- 
gle cloud. For the single cloud case, we focus on a database 
with a single relation. The extension to include joins will be 
considered in future work. We once again cast the process 
of query verification as a game. The game has the following 
characteristics: 

Players: Two players, the Data Owner (O) and the out- 
sourced Server(S). 

Actions: The data owner begins the game by selecting a 
probability a, and declares this probability to the server. 
This probability a is the probability with which the result 
of the query (Q) will be verified (v). With probability 1 — 
a, the query will not be verified (n). After receiving this 
probability, the server may choose to cheat (c), revealing 
g't,- — Q'(R), an incorrect result, or honestly (h) give the 
result qs = Q{R). The data owner then verifies with the 
probability a, first by performing a local evaluation, then, if 
necessary, a full query audit. 

Inforrnatton: O has given his database relation (R) to S, 
along with authenication codes for each tuple (to prevent 
modification). O retains a sketch of the database {R!) which 
will be used for verification. O has a process V{Q, q) which 
determines whether the argument q is equal to qs with high 
probability, using the sketch R' . In addition, an auditing 
method exists A(Q, q, R) which will determine with certainty 
whether the query was executed correctly, but is very ex- 
pensive. The players have entered into a contract before 
the game, and the contents of the contract are known to all 
players. This contract can adjust the payoffs below. 

Payoffs: Let ptp be the probability that V{Q,qs) returns 
true, ptn be the probability that V{Q,q's) returns false, 
Pfp = 1 ~ Ptn, and pfn = 1 — Ptp (These are the proba- 
bilities of true positives, true negatives, false positives and 
false negatives from V , respectively). Let C(X) be the cost 
of computing the expression X. Let /„(A') be the value of 
the information given by X. The expected utilities (pay- 
offs) for each player, without the intervening contract, are 
as follows: 

When the owner decides not to verify, he simply receives the 
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value of the query result (honest or not), minus the amount we can write the above as: 
paid for the calculation, resulting in: 

Uo(a, h) = auo(v, h) + (1 — a)uo(n, h) 
uo{n, h) = h{qs) - P{Q) ws(a, h) = aus{v, h) + (1 - a)us{n, h) 

uo(n,c) = Iviq's) - P{Q) uo{a,c) = auo{v,c) + (1 - a)uo{n,c) 

us{a, c) = aus{v, c) + (1 — a)us(n, c) 



Similarly, the server simply gains the amount paid, minus 
the cost of the calculation: 

us{n,h) = P{Q) - C(qs) 
us{n,c)=P{Q)-C{q's) 

If the owner chooses to verify, he also pays the cost of veri- 
fication, and in the case of a failed V, also pays the cost of 
an audit. If the audit fails (which would only happen in the 
case of a cheating server), he does not pay the price for the 
calculation. 

uoiv, h) = h{qs) - P{Q) - C{V{Q, qs)) 

-pf„-C{A{Q,qs,R)) 
uo{v,c) = h{Q'{R)) - C{V{Q,q's)) 

-ptn-CiA{Q,q's,R))-pfp-P{Q) 

An honest server, in the case of the verification, gets the 
same payout ho would without verification. This is the price 
of the query minus the cost to calculate it. A cheating server 
is only paid if he is not caught, so he is only paid in the case 
of a false positive from V. 

us{v,h) = P{Q)-C{qs) 
us{v, c) = pfp ■ P{Q) - C{q's) 

Now, since O declares a verification probability in advance. 



Note that, in practice, a payment might not be rendered 
for every query, and instead the server might charge a fiat 
fee for its services, or some other payment structure. In 
these cases, one could consider the total payments spread 
out throughout the queries. This assumption that payment 
is rendered for each query will not invalidate our solution. 

We make some assumptions about the values used above. 
First, we assume that Iv{qs) > P{Q) > C{qs). This is 
because this inequality is necessary for participation in the 
game to be individually rational (since this guarantees that 
the best expected payoff for each player, assuming no one 
cheats, is at least zero). Naturally, if the query were not 
worth enough to the owner, he would not pay the price, and 
if the price did not cover the cost of computation for the 
server, he would not perform the calculation. Second, we 
assume that as q'g approaches qs, C(qs) approaches C{qs)- 
This implies that it is difficult to compute a q'g such that 
V{Q,qg) is expectedly true. As q'g moves away from qs, the 
cost can decrease. This provides the initial incentive for the 
server to cheat. These assumptions are logical, since com- 
puting a value close to actual result becomes more and more 
like computing the actual result. For example, if a cheat- 
ing server were to run the query on a sample of the data 
and extrapolate the result, the estimated result would get 
more accurate as the sample size got larger, but the compu- 
tational resources used would also increase. We assume that 
the cost of V and A are constant for a given Q (no matter 
if qs or q's is provided to them as an argument). Finally, 
we once again assume that Iv(q's) < < Iv{qs), due to the 
result not being what O asked for. We also assume that 
C{A{Q,qg,R)) < Iv{qs) — Iv{q's)i since if the audit were 
more costly than the amount of information supplied by the 



query, the audit would not take place. 



Canceling out like terms, we get: 



We now outline a contract which removes the incentive for 
the server to cheat. It is quite similar to the contract for the 
two-cloud case. 

Contract 3 If the owner chooses to verify, and it is deter- 
mined that the server has cheated, the server pays a penalty 
oi F + C{A{Q,q'g, R)). (Note: We explicitly force a cheating 
server to pay the audit cost.) 

Theorem 5 The game, using the above contract (depicted 
in figure 4), has an equilibrium in pure strategies. O will 
select an a which makes cheating less profitable (expectedly) 
than correctly revealing the result. 5* chooses not to cheat. 

Proof: The above contract makes the following changes to 
the original utilities: 

uo{v,c) = Iv{q's)-C{V{Q,g's)) 

-Ptn -F^pfp- P{Q) 
us{v,c) = Pfp-P{Q)-C{q's) 

-Ptn-{C{A{Q,q's,R)) + F) 

Recall that the expected payouts payouts are: 



C{q's) - C{qs) > Q(p/p - l)P(Q) 

-~aptniC{AiQ,q's,R))+F) 

Now, since Q' is easier to compute than Q, pfp < 1, both 
sides of this inequality are negative. We therefore multiply 
both sides by —1 and simplify to get the following: 

C{qs)-C{q's)<a{{l-pfp)P{Q) 

+ Pt4CiA{Q,q's,R)) + F)) 

Since 1 — p/p is equal to ptn, we can substitute in ptn, then 
divide by the coefficient of a, yielding the final expression: 

^ ^ Cjqs) ~ Cjq's) 

- ptniCiA{Q,q's,R)) + F + P{Q)) 

When a increases, the payout for cheating decreases, pro- 
vided C{A{Q, q's, R)) and F are large enough. So, as long as 
the expression above is satisfied, the server will choose not 
to cheat. 



uo{a, h) = auo{v, h) + (1 — a)uo{n, h) 
us{a, h) = aus{v, ft) + (1 - a)us{n, h) 
uo{oL, c) = auo{v, c) + (1 — a)uo{n, c) 
«s(q, c) = aus{v, c) + (1 — a)us{n, c) 



We want to find the a such that its (a, h) > us (a, c). 
Substituting in, we get: 

us{a,h) = a{P{Q) ~ C{qs)) 

+{l-a){PiQ)-C{q.s)) 
= P{Q) - C{qs) 
Ms(a,c) = a{pfpP{Q) ^ C{q's) 

-Ptn{C{A{Q,q's,R))+F)) 

+(l-a)(P(Q)-C(g^)) 

So, we have the inequality (after multiplying through by a 
and 1 — a: 

P{Q) - C{qs) > ap/pP(Q) - aC{q's) 

^aptn{C{A{Q,q's,R)) + F) 

+P{Q)~C{q's)-aP{Q) 

+aC{q's) 

Rearranging terms, we get: 



P(Q) - P{Q) + C{q's) - C{qs) > a{pfpP{Q) - Ciq's) 

-ptn{C{A{Q,q's,R)) + F) 
-P{Q) + C{q's)) 



Now, while C{A{Q, q'g, R)) is fixed, F is something that can 
be adjusted in the contract! Therefore, if the penalty F is 
astronomically high, we can severely reduce a, while main- 
taining that there is no incentive to cheat for S. This is 
what is known as a "boiling-in-oil" contract 
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We must also show that this a is incentive compatible for 
O. Consider what happens when a is increased. If a is 
greater than the above value, O ends up verifying more, 
while S continues telling the truth. Because of this, O loses 
valuation. So, O will not choose a higher than this. If a is 
less than this value, then S will start cheating! The possible 
increase in payout to O would be aF, but since a is small, 
and Iv{qs) is so much greater than I^iq's), this would likely 
result in a decrease in payout for O. Therefore, a is not 
less than the above expression either. Thus, we have an 
equilibrium. □ 

6. IMPLEMENTATION DETAILS 

The game outlined above is fairly general, and allows for any 
local verification method V to be used. Here, we outline a 
simple sampling verification method which becomes much 
more viable when the verification process is not being run 
with every query. First, let us assume that the data consists 
of A'' signed tuples, each of which has a unique, consecutive 
id from 1 to A*'. Let O maintain some sample of size k of 
these A'^ tuples, together with the value N. This sample is 
selected uniformly at random from the entire data set, with 
replacement. This sample can be used to compute V{Q,q's) 
for many different types of queries. For aggregate queries 
such as count, sum, average, standard deviation, etc, one 
could simply perform the action on the k tuples, and ex- 
trapolate based on A'^. If this sample value is within some 
e of the query result q'g, then we declare the result correct. 
Otherwise, we perform the audit. 



For selection queries, note that because each tuple is signed, 
we know that the server cannot modify any tuples, nor can it 
insert new tuples. It can only either remove relevant tuples 
from the result, or insert irrelevant tuples into the result. If 
the server inserts irrelevant tuples, this can be easily verified 
by O by simply noting that the tuple does not match the 
query. Thus, it is only difficult to verify when a tuple has 
been left out. As before, we can perform the selection query 
on the sample of k tuples, and extrapolate the number of 
tuples that should be returned by qs- If the number of 
tuples in q'^ is within some e, we declare the result correct. 
If the number of tuples in q'g is greater than our estimate, 
then we should also declare the result correct, since a greater 
number of tuples cannot be wrong. Otherwise, we perform 
the audit. 

There are plenty of other methods used to verify queries on 
outsourced data, and any of them would work as a verifica- 
tion method V in our scheme. We choose this one, however, 
because of its simplicity. Note that it does not require ex- 
pensive cryptographic operations. 

One thing remains in the definition of the verification mech- 
anism, and that is the definition of e. As the selection of k 
tuples can be considered a selection of k random variables 
Xi, Xk € R, and in each case we are interested in a func- 
tion / which maps i? — >■ 5R, and any alteration in a given 
Xi can only change the value of the aggregate function by 
at most some d (this Ci is 1 for count, the max value of 
the given attribute for sum, the max value squared for stan- 
dard deviation, etc), we can apply McDiarmid's inequality 
[9], giving us the following: 

Pr{\E[f{Xi,...,Xk)]-f{X^,...,Xk)\>e}<2e 

Note, this inequality does not depend on the value of A^. 
It simply depends on the sample size. For example, say 
we want to devise a sample size k such that the probabil- 
ity that an average query on attribute a of the sample is 
within e = 1% of the true result with probability .999. Ci is 
given as IS25iHl The probability in the above works out 

_ -QQQ2avcrago{Q}^ 

to 2e max{|o|}^/fc 'y^g want this to be less than or equal 
to .001. Solving for k, we get 

_^002average^ 

~ln{mQ^)max{\a\Y ^ ^ 
.0002average{a}^ ~ 

This gives us a value of approximately 38004.51 times the 
maximum value of the attribute a, divided by the square of 
the result. As the average of the result is no more than the 
maximum value of a, but its square can be much larger, k can 
be 38 thousand tuples, or less, depending on the distribution 
of a, even if the number of tuples is in the millions. Note 
that this is does not help the data owner find the value of k, 
as the owner does not know the actual result. This merely 
shows that a good k exists, and it is independent of the 
number of tuples in the dataset for many common queries. 

38,000 tuples is not a particularly small number, especially 



with some sketches using only three bytes [21]. However, 
this sample can be used to verify many different types of 
queries, and does not have to plan for the queries in advance. 
In addition, the verification will only be performed a fraction 
(a) of the time. This fraction, through the use of the penalty 
in the contract, can be made arbitrarily small, leading to a 
very fast expected runtime. 

Now, one method of generating a false query (for aggre- 
gate queries) might be to use the same sampling method as 
above. Note, however, that in order to ensure that the sam- 
ple chosen by the server has a result within e of the result 
of the owner's sample, the server would need a much tighter 
epsilon. 

With some probability S, the owner's sample result is within 
e of the correct result. This is also true for the server. How- 
ever, consider the worst case where the owner's sample value 
is Q{R) — e and the server's sample value is Q{R) + e. The 
probability S is not sufficient to bound the difference between 
the two sample values. To ensure that with probability 5 the 
owner's sample result is within e of the result returned by 
the server (with the given high probability), the server would 
have to return the actual result, as any leeway would lead 
to a worst case scenario where the difference is greater than 
e. 

In order to prevent sampling bias, a protocol could be im- 
plemented to resample from the data. As each tuple has 
a unique id, the owner could, at some interval, request the 
tuples with a given set of ids. The owner would know if the 
tuples he desired were not returned. In order to prevent the 
server from learning the exact sample (which would lead to 
the server simply using that sample for calculations), the 
owner would select some dummy tuples, or in some cases, 
the entire data set. A similar method, selecting all tuples 
involved, can be employed for auditing the queries. 

7. EXPERIMENTS 

To test the effectiveness of the sampling protocol for catch- 
ing cheating on real data, we ran a series of experiments. 
The mechanisms outlined in sections 4 and 5 do not need 
any experimentation, as they are proven and mathematically 
sound. These experiments were designed to show that the 
sampling technique can identify cheating with a non-trivial 
probability. Other verification methods will work similarly 
in our framework, as long as they can identify cheating with 
non-trivial probability. For example, if a simple method ex- 
ists to verify a certain query deterministically, then it could 
be called in place of our sampling scheme, and would allow 
our Q to be even smaller. The sampling protocol is impor- 
tant, however, due to its generality and simplicity. 

7.1 Methodology 

We used the US Census 1990 data set from the UC Irvine 
Machine Learning repository, which contains over 2.3 million 
tuples [1]. We focused on a few major fields in this data set. 

We processed results for eight different aggregation queries 
of varying types. Since selection queries can be estimated 
via counts, we chose to focus on aggregation queries. 

The query types are as follows: 



1. Count, equality selection (count the people whose race 
value is 2-black) 

2. Count, range selection (count the people whose income 
is greater than 40000) 

3. Count, range and equality conjunction (count the peo- 
ple who are over age 30 and never married) 

4. Count, range disjunction (count the people who are 
under age 18 or have an income of less than 10000) 

5. Sum, equality selection (find the sum of the incomes 
of all people who never married) 

6. Sum, range and equality conjunction (find the sum 
of the incomes of all people who are over age 40 and 
whose place of birth is the place they work) 

7. Average, range selection (find the average age of all 
people who have an income greater than 80000) 

8. Average, equality conjunction, sparse result (find the 
average income of all people who are male and of race 
9-Japanese) 

For each query, we ran 100 trials, estimating the full result 
of the query with five different sample sizes: 1000 tuples, 
5000 tuples, 10000 tuples, 20000 tuples, and 40000 tuples. 
As above, these samples are selected uniformly at random 
with replacement. We determined the likelihood that each 
sample would accept the correct value for varying values of 
e from to .5r where r is the estimated result. Since the 
verification process would not know the actual result, we 
based the e on the estimated result given by the sample, as 
we expected it to be close to the actual result. 

We then ran the samples against different means of falsifying 
the result, to determine if the sample method could catch a 
cheater. The first type of falsification was the same as our 
verification technique, sampling the data. We once again 
ran 100 trials for 1000, 5000, 10000, 20000, and 40000 tuple 
samples, both for detecting the cheating and for creating the 
cheating. 

The second type of falsification was a "worst case" falsifica- 
tion, where the exact result was computed, but then Laplace 
noise was added to the final result. An adversary would 
never actually do this, as it would be more expensive than 
simply computing the result itself, but this provides a way to 
test our scheme beyond the normal means. The mean of the 
Laplace noise was of course the result itself (which we will 
call r), whereas the width parameter was varied from r/5, 
r/10, r/20, and r/50. We chose Laplace noise as opposed 
to any other type of noise because it is used in differen- 
tial privacy as a means of masking query results while still 
achieving meaningful results (51 . Each of these sets of noise 
ran 100 trials against each sample size as before. 

7.2 Results 

Space restrictions prevent the inclusion of every graph gener- 
ated by the experiments. However, if we examine one factor 
at a time, we can show the general trend of the sampling 
protocol to correctly or incorrectly identify cheating values. 
The omitted graphs show similar trends. 



Query Type Figure 5 shows the ROC (receiver operating 
characteristic) curves for each of the eight queries, for a sam- 
ple size of 10000, against every type of result falsification we 
used. These ROC curves shows the tradeoff between the 
probability of a false negative and the probability of a true 
negative. The queries themselves all behave similarly. At a 
sample size of 10000, we can always find an e where some 
nontrivial fraction of cheating will be caught. There is al- 
ways a tradeoff, however. As e decreases, more legitimate 
results will be marked as wrong, and forced to be fully au- 
dited. Proper use of the sampling technique involves careful 
selection of epsilon in order to increase ptn, while reducing 
Pf„ as much as possible. In practice, it is more important 
that Ptn be high, since we can mitigate the effect of false pos- 
itives by increasing the penalty, thereby decreasing a and 
reducing the number of times that we do the verification. 
Ptn is acceptably high for fairly small e (0.00125r to 0.005r). 

Sample Size Figure 6 shows the ROC curves for query 2 for 
each sample size used. Clearly, as the sample size increases, 
the potential for better choices of epsilon increases. With a 
sample size of 40000, there is even a type of cheating (r/5 
noise) that allows for a false negative rate of .02 and a true 
negative rate of .95! The obvious tradeoff here, though, is 
that while you will do fewer full audits with a larger sample 
size, the verification process will take more resources. The 
smaller sample sizes still have the ability to catch cheating, 
but they will end up auditing many more legitimate results. 

Cheater Sample Size Figure 7 shows the ROC curves 
for query 3, against cheaters using sampling only. We can 
clearly see that the curves move down and to the right as 
our adversary's sample size increases. This makes sense, 
as the cheater gets better at impersonating the correct re- 
sult, it becomes more difficult to distinguish the incorrect 
results from the correct ones. However, in every case, even 
with a sample size of 1000, we are able to detect cheating 
better than random guessing. Keep in mind that, in order 
to be useful, we merely need to be able to detect cheating 
with some non-negligible probability, and that any means 
we choose to do that is acceptable. 

Cheater Laplace Noise Figure 8 shows the same query 
as figure 7, but this time with our cheater only using the 
Laplace noise. Surprisingly, the noise addition version ends 
up being easier to catch. This is due to the fact that the 
parameter on the Laplace noise is large enough to cause 
issues. Still, at r/50, the cheating becomes quite difficult to 
detect for low sample size. 

8. CONCLUSIONS 

In summary, by thinking about the problem of query ver- 
ification from a different perspective, namely, that of an 
economist, we can drastically reduce the computation re- 
quired to ensure that the result you asked for is the result 
you received. The various query verification methods that 
are out there are still quite useful, however. Specialized ver- 
ification methods which take up very little space work well 
for common queries, and in our game-theoretic framework, 
would only be required to run a fraction of the time. They 
are, however, not generic and can rely on some expensive 
operations. No matter what sort of verification method we 
choose, our contract-based computation vastly simplifies the 
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Figure 5: ROC Curves for the 8 Query Types: Sample Size 10000 
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Figure 6: ROC Curves for Five Sample Sizes: Query 2 
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Figure 7: ROC Curves for Five Sample Sizes: Query 3, Sampling Cheater Only 
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Figure 8: ROC Curves for Five Sample Sizes: Query 3, Noisy Cheater Only 



overall process of query verification. 

8.1 Future Work 

In the future, we will consider other types of verification 
methods, and how they might be better served by the game- 
theoretic framework outlined here. In addition, we will con- 
sider joins, which are disproportionately resource intensive 
compared to other database operations, possibly resulting 
in a need for a revised incentive structure. 
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